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Abstract —The problem of estimating subjective visual properties from image and video has attracted increasing interest. A subjective 
visual property is useful either on its own (e.g. image and video interestingness) or as an intermediate representation for visual 
recognition (e.g. a relative attribute). Due to its ambiguous nature, annotating the value of a subjective visual property for learning 
a prediction model is challenging. To make the annotation more reliable, recent studies employ crowdsourcing tools to collect pairwise 
comparison labels. However, using crowdsourced data also introduces outliers. Existing methods rely on majority voting to prune the 
annotation outliers/errors. They thus require a large amount of pairwise labels to be collected. More importantly as a local outlier 
detection method, majority voting is ineffective in identifying outliers that can cause global ranking inconsistencies. In this paper, we 
propose a more principled way to identify annotation outliers by formulating the subjective visual property prediction task as a unified 
robust learning to rank problem, tackling both the outlier detection and learning to rank jointly. This differs from existing methods 
in that (1) the proposed method integrates local pairwise comparison labels together to minimise a cost that corresponds to global 
inconsistency of ranking order, and (2) the outlier detection and learning to rank problems are solved jointly. This not only leads to 
better detection of annotation outliers but also enables learning with extremely sparse annotations. 

Index Terms —Subjective visual properties, outlier detection, robust ranking, robust learning to rank, regularisation path 
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1 Introduction 

The solutions to many computer vision problems involve 
the estimation of some visual properties of an image or 
video, represented as either discrete or continuous vari¬ 
ables. For example scene classification aims to estimate 
the value of a discrete variable indicating which scene 
category an image belongs to; for object detection the 
task is to estimate a binary variable corresponding the 
presence/absence of the object of interest and a set of 
variables indicating its whereabouts in the image plane 
(e.g. four variables if the whereabouts are represented 
as bounding boxes).Most of these visual properties are 
objective; that is, there is no or little ambiguity in their 
true values to a human annotator. 

In comparison, the problem of estimating subjective 
visual properties is much less studied. This class of 
computer vision problems nevertheless encompass a va¬ 
riety of important applications. For example: estimating 
attractiveness (T| from faces would interest social media 
or online dating websites; and estimating properties of 
consumer goods such as shininess of shoes 0 improves 
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customer experiences on online shopping websites. Re¬ 
cently, the problem of automatically predicting if people 
would find an image or video interesting has started to 
receive increasing attention 0, 21, (5J. Interestingness 
prediction has a number of real-world applications. In 
particular, since the number of images and videos up¬ 
loaded to the Internet is growing explosively, people are 
increasingly relying on image/video recommendation 
tools to select which ones to view. Given a query, ranking 
the retrieved data with relevance to the query based 
on the predicted interestingness would improve user 
satisfaction. Similarly user stickiness can be increased 
if a media-sharing website such as YouTube can rec¬ 
ommend videos that are both relevant and interesting. 
Other applications such as web advertising and video 
summarisation can also benefit. Subjective visual prop¬ 
erties such as the above-mentioned ones are useful on 
their own. But they can also be used as an intermediate 
representation for other tasks such as visual recognition, 
e.g., different people can be recognised by how pale their 
skin complexions are and how chubby their faces are If6l . 
When used as a semantically meaningful representation, 
these subjective visual properties often are referred to as 
relative attributes EL EL 0- 
Learning a model for subjective visual property (SVP) 
prediction is challenging primarily due to the difficulties 
in obtaining annotated training data. Specifically, since 
most SVPs can be represented as continuous variables 
(e.g. an interestingness/aesthetics/shininess score with 
a value range of 0 to 1 with 1 being most interest¬ 
ing/aesthetically appealing/shinning), SVP prediction 
can be cast as a regression problem - the low-level 
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feature values are regressed to the SVP values given 
a set of training data annotated with their true SVP 
values. However, since by definition these properties are 
subjective, different human annotators often struggle to 
give an absolute value and as a result the annotations of 
different people on the same instance can vary hugely 
For example, on a scale of 1 to 10, different people will 
have very different ideas on what a scale 5 means for an 
image, especially without any common reference point. 
On the other hand, it is noted that humans can in general 
more accurately rank a pair of data points in terms of 
their visual properties (HI, (9) , e.g. it is easier to judge 
which of two images is more interesting relatively than 
giving an absolute interestingness score to each of them. 
Most existing studies (3, (lj, 0 on SVP prediction thus 
take a learning to rank approach lUOl , where annotators 
give comparative labels about pairs of images/videos 
and the learned model is a ranking function that predicts 
the SVP value as a ranking score. 

To annotate these pairwise comparisons, crowdsourc¬ 
ing tools such as Amazon Mechanic Turk (AMT) are 
resorted to, which allow a large number of annotators 
to collaborate at very low cost. Data annotation based 
on crowdsourcing is increasingly popular (6|, (3/ (U, El 
recently for annotating large-scale datasets. However, 
this brings about two new problems: (1) Outliers - 
The crowd is not all trustworthy: it is well known that 
crowdsourced data are greatly affected by noise and out¬ 
liers inn , nm mi which can be caused by a number of 
factors. Some workers may be lazy or malicious [|14J, pro¬ 
viding random or wrong annotations either carelessly or 
intentionally; some other outliers are unintentional hu¬ 
man errors caused by the ambiguous nature of the data, 
thus are unavoidable regardless how good the attitudes 
of the workers are. For example, the pairwise ranking 
for Figure [lja) depends on the cultural/psychological 
background of the annotator - whether s/he is more 
familiar/prefers the story of Monkey King or Cookie 
MonsteJB When we learn the model from labels collected 
from many people, we essentially aim to learn the con¬ 
sensus, i.e. what most people would agree on. Therefore, 
if most of the annotators growing up watching Sesame 
Street thus consciously or subconsciously consider the 
Cookie Monster to be more interesting than the Monkey 
King, their pairwise labels/votes would represent the 
consensus. In contrast, one annotator who is familiar 
with the stories in Journey to the West may choose 
the opposite; his/her label is thus an outlier under 
the consensus. (2) Sparsity - the number of pairwise 
comparisons required is much bigger than the number of 
data points because n instances define a 0(n 2 ) pairwise 
space. Consequently, even with crowdsourcing tools, 
the annotation remains be sparse, i.e. not all pairs are 
compared and each pair is only compared a few times. 

To deal with the outlier problem in crowdsourced 
data, existing studies take a majority voting strategy (H, 


1. This is also known as Halo Effect in Psychology. 



Figure 1. Examples of pairwise comparisons of subjective 
visual properties. 


13/ EL ESI, [16], [|17], |18|. That is, a large budget 
of 5 — 10 times the number of actual annotated pairs 
required is allocated to obtain multiple annotations for 
each pair. These annotations are then averaged over so as 
to eliminate label noise. However, the effectiveness of the 
majority voting strategy is often limited by the sparsity 
problem - it is typically infeasible to have many annota¬ 
tors for each pair. Furthermore, there is no guarantee that 
outliers, particularly those caused by unintentional hu¬ 
man errors can be dealt with effectively. This is because 
majority voting is a local consistency detection based 
strategy - when there are contradictory /inconsistent 
pairwise rankings for a given pair, the pairwise rankings 
receiving minority votes are eliminated as outliers. How¬ 
ever, it has been found that when pairwise local rankings 
are integrated into a global ranking, it is possible to 
detect outliers that can cause global inconsistency and 
yet are locally consistent, i.e. supported by majority votes 
lfl9l . Critically, outliers that cause global inconsistency 
have more significant detrimental effects on learning a 
ranking function for SVP prediction and thus should be 
the main focus of an outlier detection method. 

In this paper we propose a novel approach to sub¬ 
jective visual property prediction from sparse and noisy 
pairwise comparison labels collected using crowdsourc¬ 
ing tools. Different from existing approaches which first 
remove outliers by majority voting, followed by regres¬ 
sion m or learning to rank 0, we formulate a unified ro¬ 
bust learning to rank (URLR) framework to solve jointly 
both the outlier detection and learning to rank problems. 
Critically, instead of detecting outliers locally and inde¬ 
pendently at each pair by majority voting, our outlier 
detection method operates globally, integrating all local 
pairwise comparisons together to minimise a cost that 
corresponds to global inconsistency of ranking order. 
This enables us to identify those outliers that receive 
majority votes but cause large global ranking inconsis¬ 
tency and thus should be removed. Furthermore, as a 
global method that aggregates comparisons across dif¬ 
ferent pairs, our method can operate with as few as one 
comparison per pair, making our method much more 
robust against the data sparsity problem compared to the 
conventional majority voting approach that aggregates 
comparisons for each pair in isolation. More specifically, 
the proposed model generalises a partially penalised 
LASSO optimisation or Huber-LASSO formulation [20], 
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ED, G3 from a robust statistical ranking formulation 
to a robust learning to rank model, making it suitable 
for SVP prediction given unseen images/videos. We also 
formulate a regularisation path based solution to solve 
this new formulation efficiently. Extensive experiments 
are carried out on benchmark datasets including two 
image and video interestingness datasets 01, El and two 
relative attribute datasets 0. The results demonstrate 
that our method significantly outperforms the state-of- 
the-art alternatives. 

2 Related work 

Subjective visual properties Subjective visual prop¬ 
erty prediction covers a large variety of computer vision 
problems; it is thus beyond the scope of this paper to 
present an exhaustive review here. Instead we focus 
mainly on the image/video interestingness prediction 
problem which share many characteristics with other 
SVP prediction problem such as image quality (23) , 
memorability [24], and aesthetics 0 prediction. 
Predicting image and video interestingness Early ef¬ 
forts on image interestingness prediction focus on dif¬ 
ferent aspects than interestingness as such, including 
memorability [[24] and aesthetics 0. These SVPs are 
related to interestingness but different. For instance, it is 
found that memorability can have a low correlation with 
interestingness - people often remember things that they 
find uninteresting 0. The work of Gygli et al 0) is the 
first systematic study of image interestingness. It shows 
that three cues contribute the most to interestingness: 
aesthetics, unusualness/novelty and general preferences, 
the last of which refers to the fact that people in general 
find certain types of scenes more interesting than oth¬ 
ers, for example outdoor-natural vs. indoor-manmade. 
Different features are then designed to represent these 
cues as input to a prediction model. In comparison, video 
interestingness has received much less attention, perhaps 
because it is even harder to understand its meaning and 
contributing cues. Liu et al. (251 focus on key frames so 
essentially treats it as an image interestingness problem, 
whilst 0 is the first work that proposes benchmark 
video interestingness datasets and evaluates different 
features for video interestingness prediction. 

Most earlier works cast the aesthetics or interesting¬ 
ness prediction problem as a regression problem 03), 0, 
EL (25). However, as discussed before, obtaining an ab¬ 
solute value of interestingness for each data point is too 
subjective and affected too much by unknown personal 
preference/social background to be reliable. Therefore 
the most recent two studies on image 0 and video 
0 interestingness all collect pairwise comparison data 
by crowdsourcing. Both use majority voting to remove 
outliers first. After that the prediction models differ 
- 0 converts pairwise comparisons into an absolute 
interestingness values and use a regression model, whilst 
0 employs rankSVM IllOl to learn a ranking function, 
with the estimated ranking score of an unseen video 
used as the interestingness prediction. We compare with 


both approaches in our experiments and demonstrate 
that our unified robust learning to rank approach is 
superior as we can remove outliers more effectively - 
even if they correspond to comparisons receiving major¬ 
ity votes, thanks to its global formulation. 

Relative attributes In a broader sense interestingness 
can be considered as one type of relative attribute (6) . 
Attribute-based modelling [26], [27] has gained popu¬ 
larity recently as a way to describe instances and classes 
at an intermediate level of representation. Attributes are 
then used for various tasks including N-shot and zero- 
shot transfer learning. Most previous studies consider 
binary attributes [26], [27]. Relative attributes 0 were 
recently proposed to learn a ranking function to predict 
relative semantic strength of visual attributes. Instead of 
the original class-level attribute comparisons in 0, this 
paper focuses on instance-level comparisons due to the 
huge intra-class variations in real-world problems. With 
instance-level pairwise comparisons, relative attributes 
have been used for interactive image search 0, and 
semi-supervised (28) or active learning (29), (30) of visual 
categories. However, no previous work addresses the 
problem of annotation outliers except 0, which adopts 
the heuristic majority voting strategy. 

Learning from noisy paired crowdsourced data Many 
large-scale computer vision problems rely on human 
intelligence tasks (HIT) using crowdsourcing services, 
e.g. AMT (Amazon Mechanical Turk) to collect an¬ 
notations. Many studies Q0, (31), |[32), 03) highlight 
the necessity of validating the random or malicious 
labels/workers and give filtering heuristics for data 
cleaning. However, these are primarily based on majority 
voting which requires a costly volume of redundant 
annotations, and has no theoretical guarantee of solving 
the outlier and sparsity problems. As a local (per-pair) 
filtering method, majority voting does not respect global 
ordering and even risks introducing additional incon¬ 
sistency due to the well-known Condorcet's paradox in 
social choice and voting theory (33). Active learning |34], 
[29], [30) is an another way to circumvent the 0(n 2 ) pair¬ 
wise labelling space. It actively poses specific requests 
to annotators and learns from their feedback, rather 
than the 'general' pairwise comparisons discussed in this 
work. Besides paired crowdsourced data, majority vot¬ 
ing is more widely used in crowdsourcing where mul¬ 
tiple annotators directly label instances, which attracted 
lots of attention in the machine learning community (16) , 
m, G20, CCD In contrast, our work focuses on pairwise 
comparisons which are relatively easier for annotators in 
evaluating the subjective visual properties 0 . 

Statistical ranking and learning to rank Statistical rank¬ 
ing has been widely studied in statistics and computer 
science (35), (S3, 0, Ezl However, statistical ranking 
only concerns the ranking of the observed/training data, 
but not learning to predict unseen data by learning 
ranking functions. To learn ranking functions for ap¬ 
plications such as interestingness prediction, a feature 
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representation of the data points must be used as model 
input in addition to the local ranking orders. This is 
addressed in learning to rank which is widely studied 
in machine learning [38], [39], [40]|. However, existing 
learning to rank works do not explicitly model and 
remove outliers for robust learning: a critical issue for 
learning from crowdsourced data in practice. In this 
work, for the first time, we study the problem of ro¬ 
bust learning to rank given extremely noisy and sparse 
crowdsourced pairwise labels. We show both theoreti¬ 
cally and experimentally that by solving both the outlier 
detection and ranking prediction problems jointly, we 
achieve better outlier detection than existing statistical 
ranking methods and better ranking prediction than 
existing learning to rank method such as RankSVM 
without outlier detection. 

Our contributions are threefold: (1) We propose a novel 
robust learning to rank method for subjective visual 
property prediction using noisy and sparse pairwise 
comparison/ranking labels as training data. (2) For the 
first time, the problems of detecting outliers and estimat¬ 
ing linear ranking models are solved jointly in a unified 
framework. (3) We demonstrate both theoretically and 
experimentally that our method is superior to existing 
majority voting based methods as well as statistical rank¬ 
ing based methods. An earlier and preliminary version 
of this work is presented in Pffl which focused only on 
the image/video interestingness prediction problem. 

3 Unified Robust Learning to Rank 

3.1 Problem definition 


We aim to learn a subjective visual property (SVP) 
prediction model from a set of sparse and noisy pairwise 
comparison labels, each comparison corresponding to 
a local ranking between a pair of images or videos. 
Suppose our training set has N data points/instances 


n N 


pN xd 


represented by a feature matrix = | ^ 
where (j) i is a d-dimensional column low-level feature 
vector representing instance i. The pairwise comparison 
labels (annotations collected using crowdsourcing tools) 
can be naturally represented as a directed comparison 
graph G = (V, E) with a node set V = {i}^ =1 corre¬ 
sponding to the N instances and an edge set E = {e^} 
corresponding to the pairwise comparisons. 

The pairwise comparison labels can be provided by 
multiple annotators. They are dichotomously saved: 
Suppose annotator a gives a pairwise comparison for 
instance i and j (i,j G V). If a considers that the SVP 
of instance i is stronger/more than that of j, we save 
(i, j, y£ ) and set y“ = 1. If the opposite is the case, 
we save (j, i, y£ ) and set y% = 1. All the pairwise 
comparisons between instances i and j are then aggre¬ 
gated over all annotators who have cast a vote on this 
pair; the results are represented as w &ij = = 1 J 

which is the total number of votes on i over j for a 
specific SVP, where [] indicates the Iverson's bracket 
notation, and w eji which is defined similarly. This gives 


an edge weight vector w = \w eij \ where \E\ is the 

number of edges. Now the edge set can be represented 
as E = { eij\w ei . >0} and w eij G R is the weight for 
the edge . In other words, an edge : i j exists if 
w ei . > 0. The topology of the graph is denoted by a flag 
indicator vector y = \y eij \ Gl^ where each indicator 
y eij = 1 indicates that there is an edge between instances 
i to j regardless how many votes it carries. Note that all 
the elements in y have the value 1 , and their index e t j 
gives the corresponding nodes in the graph. 

Given the training data consisting of the feature matrix 

and the annotation graph G, there are two tasks: 

1) Detecting and removing the outliers in the edge set 
E of G. To this end, we introduce a set of unknown 
variables 7 = \y eij \ £ where each variable 
j eij indicates whether the edge is an outlier. 
The outlier detection problem thus becomes the 
problem of estimating 7 . 

2) Estimating a prediction function for SVP. In this 
work a linear model is considered due to its low 
computational complexity, that is, given the low- 
level feature <fr x of a test instance x we use a linear 
function f(x) = /3 T (f> x to predict its SVP, where 
/3 is the coefficient weight vector of the low-level 
feature 4> x . Note that all formulations can be easily 
updated to use a non-linear function. 

So far in the introduced notations three vectors share 
indices: the flag indicator vector y, the outlier variable 
vector 7 and the edge weight vector w. For notation 
convenience, from now on we use yij, 7 ^ and Wij to 
replace y eij , j eij and w e .. respectively. As in most graph 
based model formulations, we define C G Ml^l xAr as the 
incident matrix of the directed graph G, where C eij i = 
— 1/1 if the edge enters/leaves vertex i. 

Note that in an ideal case, one hopes that the votes 
received on each pair are unanimous, e.g. w %3 > 0 and 
Wji = 0 ; but often there are disagreements, i.e. we have 
both > 0 and Wji > 0. Assuming both cannot be true 
simultaneously, one of them will be an outlier. In this 
case, one is the majority and the other minority which 
will be pruned by the majority voting method. This is 
why majority voting is a local outlier detection method 
and requires as many votes per pair as possible to be 
effective (the wisdom of a crowd). 

3.2 Framework formulation 

In contrast to majority voting, we propose to prune 
outliers globally and jointly with learning the SVP pre¬ 
diction function. To this end, the outlier variables 7 ^ for 
outlier detection and the coefficient weight vector (3 for 
SVP prediction are estimated in a unified framework. 
Specifically, for each edge e l3 G E, its corresponding flag 
indicator y ZJ is modelled as 

Vij = P T 4>i ~ 4>j + 7 ij + £ij, ( 1 ) 

where Sij ~ AT(0,cr 2 ) is the Gaussian noise with zero 
mean and a variance a, and the outlier variable 7 ^ G M 
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is assumed to have a higher magnitude than a. For an 
edge eij, if y i3 is not an outlier, we expect f3 T — (3 T 
should be approximately equal to y l3 , therefore we have 
7 ij = 0. On the contrary, when the prediction of /3 T 0 i - 
(3 <f>j differs greatly from yij, we can explain as an 
outlier and compensate for the discrepancy between the 
prediction and the annotation with a nonzero value of 
7 ij. The only prior knowledge we have on 7 ^ is that it 
is a sparse variable, i.e. in most cases 7 ^ = 0 . 

For the whole training set, Eq ([l]) can be re-written in 
its matrix form 

y = C§(3 + 7 + e ( 2 ) 

where y = [y^] G M |s| , 7 = [ 7 ij] € ^ |£;| / e = [e^] G M |£;| 
and C €M} E \ xN is the incident matrix of the annotation 
graph G. 

In order to estimate the \E\ + d unknown parameters 
(\E\ for 7 and d for (3), we aim to minimise the dis¬ 
crepancy between the annotation y and our prediction 
C$(3 + 7 , as well as keeping the outlier estimation 7 
sparse. Note that y only contains information about 
which pairs of instances have received votes, but not 
how many The discrepancy thus needs to weighted by 
the number of votes received, represented by the edge 
weight vector w = \w%j\ G To that end, we put 

a weighted l 2 — loss on the discrepancy and a sparsity 
enhancing penalty on the outlier variables. This gives us 
the following cost function: 

= \\\y-c$p-~i\\l )W +p\(ci) ( 3 ) 

where 

G-ij (zE 

and pa ( 7 ) is the sparsity constraint on 7 . With this cost 
function, our Unified Robust Learning to Rank (URLR) 
framework identifies outliers globally by integrating all 
local pairwise comparisons together. Note that in Eq 
the noise term e has been removed because the 
discrepancy is mainly caused by outliers due to their 
larger magnitude. 

Ideally the sparsity enhancing penalty term pa(t) 
should be a Iq regularisation term. However, for a 
tractable solution, a h regularisation term is used: 
P\{l) = A|| 7 ||i, to = A£ ey ^| 7i ,|, where A is a free 
parameter corresponding to the weight for the regulari¬ 
sation term. With this l\ penalty term, the cost function 
becomes convex: 

£(£,7) = \w^w(y - 7 ) - X0\\l + Ahllr,™, (4) 

where X = y/W CT>, W = diag(u;) is the diagonal matrix 
of w and y/W = diag(v / w)- 

Setting §4 = 0, the problem of minimisation of the cost 
function in (0 can be decomposed into the following two 
subproblems: 


1) Estimating the parameters f3 of the prediction func¬ 
tion f(x): 

p = (X T X)iX T VW(y- 7 ), (5) 

Mathematically, the Moore-Penrose pseudo¬ 
inverse of X T X is defined as ( X T X)^ = 
lim((X T X) T (X T X) + yI)~ 1 (X T X) T / where 

jLt—>■ 0 

I is the identity matrix. The scalar variable /1 is 
introduced to avoid numerical instability [421, and 
typically assumes a small valu^] With the the 
introduction of /x, Eq (|5]) becomes: 

0=(X T X + iJiI)- l X T y/W(v-'y). (6) 

A standard solver for Eq ^ 6 jl has a 0(\E\d 2 ) com- 
putational complexity, which is almost linear with 
respect to the size of the graph \E\ if d n. 
Faster algorithms based on the Krylov iterative and 
algebraic multi-grid methods [43] can also be used. 

2) Outlier detection: 

7 = argmin T 4|| (/ - H)VW(y - 7 )lll + A|| 7 ||i,to( 7 ) 
= argmin^Hly-A7)||^ + A||7 ||i iW (8) 

where H = X(X T X)^X T is the hat matrix, X = 

(. I — H)VW and y = Xy. Eq 0 is obtained by 
plugging the solution (3 back into Eq 0 . 

3.3 Outlier detection by regularisation path 

From the formulations described above, it is clear that 
outlier detection by solving Eq 0 is the key - once the 
outliers are identified, the estimated 7 can be used to 
substitute 7 in Eq 0 and the estimation of the prediction 
function parameter [3 becomes straightforward. Now let 
us focus on solving Eq 0 for outlier detection. 

Note that solving Eq 0 is essentially a LASSO (Least 
Absolute Shrinkage and Selection Operator) | 20 l prob¬ 
lem. For a LASSO problem, tuning the regularisation 
parameter A is notoriously difficult [44], |45|, (46ll , If47l . In 
particular, in our URLR framework, the A value directly 
decides the ratio of outliers in the training set which is 
unknown. A number of methods for determining A exist, 
but none is suitable for our formulation: 

1) Some heuristics rules on setting the value of A such 
as A = 2.5a are popular in existing robust ranking 
models such as the M-estimator M, where a is a 
Gaussian variance set manually based on human 
prior knowledge. However setting a constant A 
value independent of dataset is far from optimal 
because the ratio of outliers may vary for different 
crowdsourced datasets. 

2 ) Cross validation is also not applicable here because 
each edge e %J is associated with a 7 ^ variable 
and any held-out edge e l3 also has an associated 
unknown variable r y lJ . As a result, cross validation 
can only optimise part of the sparse variables while 

2. In this work, /i is set to 0.001. 


leaving those for the held-out validation set unde¬ 
termined. 

3) Data adaptive techniques such as Scaled LASSO 
[451 and Square-Root LASSO [ 46 1 typically generate 
over-estimates on the support set of outliers. More¬ 
over, they rely on the homogeneous Gaussian noise 
assumption which is often not valid in practice. 

4) The other alternatives e.g. Akaike information cri¬ 
terion (AIC) and Bayesian information criterion 
(BIC) are often unstable in outlier detection LASSO 
problems 8470 

This inspires us to sequentially consider all available 
solutions for all sparse variables along the Regularisation 
Path (RP) by gradually decreasing the value of the 
regularisation parameter A from oo to 0 . Specifically, 
based on the piecewise-linearity property of LASSO, a 
regularisation path can be efficiently computed by the R- 
package "glmnet" |48^| When A = oo, the regularisation 
parameter will strongly penalise outlier detection: if any 
annotation is taken as an outlier, it will greatly increase 
the value of the cost function in Eq ([§}. When A is 
changed from oo to 0 , LASSC]^ will first select the vari¬ 
able subset accounting for the highest deviations to the 
observations X in Eq These high deviations should 
be assigned higher priority to represent the nonzero 
element^] of 7 of Eq because 7 compensates the 
discrepancy between annotation and prediction. Based 
on this idea, we can order the edge set E according to 
which nonzero 7 ^ appears first when A is decreased 
from oc to 0. In other words, if an edge whose 
associated outlier variable 7 becomes nonzero at a 
larger A value, it has a higher probability to be an outlier. 
Following this order, we identify the top p% edge set A p 
as the annotation outliers. And its complementary set 
Ai_ p = E \ A p are the inliers. Therefore, the outcome of 
estimating 7 using Eq § is a binary outlier indicator 
vector / = [f eij ]: 


1 &ij E A \—p 

0 eij G A p 


where each element f e indicates whether the corre¬ 
sponding edge is an outlier or not. 

Now with the outlier indicator vector f estimated 
using regularisation path, instead of estimating (3 by 
substituting 7 in Eq ([5} with an estimated 7 , (3 can be 
computed as 

jS = {X t FX + 11 I)~ 1 X T VWFy (9) 


3. We found empirically that the model automatically selected by 
BIC or AIC failed to detect any meaningful outliers in our experi¬ 
ments. For details of the experiments and a discussion on the issue 
of determining the outlier ratio, please visit the project webpage at 
http:/ / www.eecs.qmul.ac.uk/~yf300/ranking/index.html 

4. http://cran.r-project.org/web/packages/glmnet/glmnet.pdf 

5. For a thorough discussion from a statistical perspective, please see 

EE ESI, ED, EE 

6. This is related with LASSO for covariate selection in a graph. 
Please see 1521 for more details. 


Algorithm 1 Learning a unified robust learning to rank 
(URLR) model for SVP prediction 

Input: A training dataset consisting of the feature matrix 
4> and the pairwise annotation graph G, and an outlier 
pruning rate p%. 

Output: Detected outliers f and prediction model pa¬ 
rameter (3. 

1) Solve Eq ^ using Regularisation Path; 

2) Take the top p% pairs as outliers to obtain the 
outlier indicator vector /; 

3) Compute (3 using Eq |9|. 


where F = diag(/), that is, we use / to 'clean up' y 
before estimating /3. 

The pseudo-code of learning our URLR model is sum¬ 
marised in Algorithm [l] 

3.4 Discussions 

3.4.1 Advantage over majority voting 
The proposed URLR framework identifies outliers glob¬ 
ally by integrating all local pairwise comparisons to¬ 
gether, in contrast to the local aggregation based majority 
voting. Figure |2|a) illustrates why our URLR framework 
is advantageous over the local majority voting method 
for outlier detection. Assume there are five images A — E 
with five pairs of them compared three times each, and 
the correct global ranking order of these 5 images in 
terms of a specific SVP is A<B<C<D<E. 
Figure |2|a) shows that among the five compared pairs, 
majority voting can successfully identify four outlier 
cases: A > B, B > C, C > D, and D > E, but not 
the fifth one E < A. However when considered globally, 
it is clear that E < A is an outlier because if we have 
A<B<C<D<E, we can deduce A < E. Our 
formulation can detect this tricky outlier. More specifi¬ 
cally, if the estimated (3 makes (3 T <j> A —[3 T <j) E > 0, it has a 
small local inconsistency cost for that minority vote edge 
A -a E. However, such (3 value will be 'propagated' to 
other images by using the voting edges B -a A, C -A B, 
D -a C, and E -a D, which are accumulated into a 
much bigger global inconsistency with the annotation. 
This enables our model to detect E -a A as an outlier, 
contrary to the majority voting decision. In particular, 
the majority voting will introduce a loop comparison 
A<B<C<D<E<A which is the well-known 
Condorcet's paradox l33l , (191 . 

We further give two more extreme cases in Figures 2 b) 
and (c). Due to the Condorcet's paradox, in Figure 2 b) 
the estimated (3 from majority voting, which removes 
A -a E, is even worse than that from all annotation 
pairs which at least save the correct annotation A -a E. 
Furthermore, Figure |2jc) shows that when each pair only 
receives votes in one direction, majority voting will cease 
to work altogether, but our URLR can still detect outliers 
by examining the global cost. This example thus high¬ 
lights the capability of URLR in coping with extremely 
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Figure 2. Better outlier detection can be achieved using 
our URLR framework than majority voting. Green ar¬ 
rows/edges indicate correct annotations, while red arrows 
are outliers. The numbers indicate the number of votes 
received by each edge. 


sparse pairwise comparison labels. In our experiments 
(see Section [4}, the advantage of URLR over majority is 
validated on various SVP prediction problems. 


3.4.2 Advantage over robust statistical ranking 
Our framework is closely related to Huber's theory of 
robust regression (44), which has been used for robust 
statistical ranking (53) . In contrast to learning to rank, 
robust statistical ranking is only concerned with ranking 
a set of training instances by integrating their (noisy) 
pairwise rankings. No low-level feature representation 
of the instances is used as robust ranking does not aim 
to learn a ranking prediction function that can be applied 
to unseen test data. To see the connection between URLR 
with robust ranking, consider the Huber M-estimator 
[44J which aims to estimate the optimal global ranking 
for a set of training instances by minimising the follow¬ 
ing cost function: 

miny ^Wijpx((0i - Qj) - yij) ( 10 ) 


where 0 = [6i\ E M} E \ is the ranking score vector storing 
the global ranking score of each training instance i. The 
Huber's loss function p\(x) is defined as 


f x 2 / 2 , if \x\ < A 

\ A|x| — A 2 / 2 , if \x\ > A. 


( 11 ) 


Using this loss function, when |(^ — Qj) — y^\ < A, the 
comparison is taken as a "good" one and penalised by 
an l 2 — loss for Gaussian noise. Otherwise, it is regarded 
as a sparse outlier and penalised by an h—loss. It can 
be shown [531 that robust ranking with Huber's loss is 
equivalent to a LASSO problem, which can been applied 
to joint robust ranking and outlier detection [421. Specif¬ 
ically, the global ranking of the training instances and 
the outliers in the pairwise rankings can be estimated as 



™ n oil y-ce-'r 
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&ij €zE 


Wij 



1ij (@i 


0j) II 2 +A|7ij| 


( 12 ) 

(13) 


The optimisation problem ( [12} is designed for solving 
the robust ranking problem with Huber's loss function, 
hence called Huber-LASSO [53]. 


Our URLR can be considered as a generalisation of 
the Huber-LASSO based robust ranking problem above. 
Comparing Eq ( [ 12 } with Eq ([3}, it can be seen that 
the main difference between URLR and conventional 
robust ranking is that in URLR the cost function has the 
low-level feature matrix 4> computed from the training 
instances, and the prediction function parameter f3, such 
that 6 = <f>/3. This is because the objective of URLR is 
to predict SVP for unseen test data. However, URLR 
and robust ranking do share one thing in common - 
the ability to detect outliers in the training data based 
on a Huber-LASSO formulation. This means that, as 
opposed to our unified framework with feature <f>, one 
could design a two-step approach for learning to rank 
by first identifying and removing outliers using Eq ( [12} , 
followed by introducing the low-level feature matrix <f> 
and prediction model parameter (3 and estimating f3 
using Eq 0. We call this approach Huber-LASSO-FL 
based learning to rank which differs from URLR mainly 
in the way outliers are detected without considering low 
level features. 

Next we show that there is a critical theoretical ad¬ 
vantage of URLR over conventional Huber-LASSO in 
detecting outliers from the training instances. This is due 
to the difference in the projection space for estimating 
7 which is denoted as Y. To explain this point, we 
decompose X in Eq 0 by Singular Value Decomposition 
(SVD), 

X = UEV t (14) 

where U = [Ui,U 2 \ with U\ being an orthogonal basis 
of the column space of X and U 2 an orthogonal basis 
of its complement. Therefore, due to the orthogonality 
U T U = I and U J X = 0, we can simplify Eq 0 into 

7 = argmin||W 2 T y-W 2 T 7 ||^ + A|| 7 ||i >1( ,. (15) 

7 

The SVD orthogonally projects y onto the column 
space of X and its complement, while U\ is an orthogo¬ 
nal basis of the column space X and U 2 is the orthogonal 
basis of its complement Y (i.e. the kernel space of X T ). 
With the SVD, we can now compute the outliers 7 
by solving Eq ( [15} which again is a LASSO problem 
(42), where outliers provide sparse approximations of 
projection U 2 y. We can thus compare dimensions of the 
projection spaces of URLR and Huber-LASSO-FL: 

• Robust ranking based on the featureless Huber- 
LASSO-F10 to see the dimension of the projection 
space T, i.e. the space of cyclic rankings (19), (53) , 
we can perform a similar SVD operation and rewrite 
Eq ( [12} in the same form as Eq ( [15} , but this time 
we have X = VWC, U x E rI^ImPI- 1 ) and U 2 E 
]g>|£|x(|£|-|v|+i) g Q t h e di mens ion of T for Huber- 
LASSO-FL is dim(r) = \E\ - \V\ + 1. __ 

• URLR: in contrast we have X = y/WC<&, U\ E 

R\ E \ xd and U 2 E So the dimension of Y 

for URLR is dim(r) = \E\ - d. 

7. We assume that the graph is connected, that is, \E\ > \V\ — 1; we 
thus have rank(C) = |V| — 1. 













From the above analysis we can see that given a very 
sparse graph with \E\ ~ \V\, the projection space T for 
Huber-LASSO-FL will have a dimension (\E\ — \V\ + 1) 
too small to be effective for detecting outliers. In contrast, 
by exploiting a low dimensional (d <C |U|) feature 
representation of the original node space, URLR can 
enlarge the projection space to that of dimension \E\—d. 
Our URLR is thus able to enlarges its outlier detection 
projection space Y. As a result our URLR can better 
identify outliers, especially for sparse pairwise anno¬ 
tation graphs. In general, this advantage exists when 
the feature dimension d is smaller than the number of 
training instance \V\ = N, and the smaller the value 
of d, the bigger the advantage over Huber-LASSO. In 
practice, given a large training set we typically have 
d <C \V\. On the other hand, when the number of 
instances is small, and each instance is represented by a 
high-dimensional feature vector, we can always reduce 
the feature dimension using techniques such as PCA 
to make sure that d \V\. This theoretical advantage 
of URLR over conventional Huber-LASSO in outlier 
detection is validated experimentally in Section [4] 

3.4.3 Regularisation on (3 

It is worth mentioning that in the cost function of URLR 
(Eq 0 ), there are two sets of variables to be estimated, 
7 and (3, but only one l\ regularisation term on 7 to 
enforce sparsity. When the dimensionality of (3 (i.e. d) is 
high, one would expect to see a l 2 regularisation term 
on (3 (e.g. ridge regression) due to the fact that the 
coefficients of highly correlated low-level features can 
be poorly estimated and exhibit high variance without 
imposing a proper size constraint on the coefficients [42j. 
The reason we do not include such a regularisation term 
is because, as mentioned above, using URLR we need to 
make sure the low-level feature space dimensionality d 
is low, which means that the dimensionality of (3 is also 
low, making the regularisation term (3 redundant. This 
leads to the applicability of much simpler solvers and 
we show empirically in the next section that satisfactory 
results can be obtained with this simplification. 

4 Experiments 

Experiments were carried out on five benchmark 
datasets (see Table [l]) which fall into three categories: (1) 
experiments on estimating subjective visual properties 
(SVPs) that are useful on their own including image 
(Section \4.1\ and video interestingness (Section \4.2\ , (2) 
experiments on estimating SVPs as relative attributes 
for visual recognition (Section |4.3| ), and (3) experiments 
on human age estimation from face images (Section 
|4.4| ). The third set of experiments can be considered as 
synthetic experiments - human age is not a subjective 
visual property although it is ambiguous and poses a 
problem even for humans (56l . However, as ground truth 
is available, this set of experiments are designed to gain 
insights into how different SVP prediction models work. 


4.1 Image interestingness prediction 


Datasets The image interestingness dataset was first 
introduced in f24l for studying memorability. It was later 
re-annotated as an image interestingness dataset by |4|. 
It consists of 2222 images. Each was represented as a 
915 dimensional attribut^] feature vector [24], 0 such as 
central object, unusual scene and so on. 16000 pairwise 
comparisons were collected by (U using AMT and used 
as annotation. On average, each image is viewed and 
compared with 11.9 other images, resulting a total of 
16000 pairwise label^] 

Settings 1000 images were randomly selected for 
training and the remaining 1222 for testing. All the ex¬ 
periments were repeated 10 times with different random 
training/test splits to reduce variance. The pruning rate 
p was set to 20 %. We also varied the number of annotated 
pairs used to test how well each compared method copes 
with increasing annotation sparsity. 

Evaluation metrics For both image and video inter¬ 
estingness prediction, Kendall tau rank distance was 
employed to measure the percentage of pairwise mis¬ 
matches between the predicted ranking order for each 
pair of test data using their prediction/ranking function 
scores, and the ground truth ranking provided by |4) and 
Il5l respectively. Larger Kendall tau rank distance means 
lower quality of the ranking order predicted. 
Competitors We compare our method (URLR) with 
four competitors. 


1) Maj-Vot-1 0: this method uses majority voting for 
outlier pruning and rankSVM for learning to rank. 

2) Maj-Vot-2 (4]: this method also first removes out¬ 
liers by majority voting. After that, the fraction of 
selections by the pairwise comparisons for each 
data point is used as an absolute interestingness 
score and a regression model is then learned for 
prediction. Note that Maj-Vot-2 was only compared 
in the experiments on image and video interesting¬ 
ness prediction, since only these two datasets have 
enough dense annotations for Maj-Vot-2. 

3) Huber-LASSO-FL : robust statistical ranking that 
performs outlier detection using the conventional 
featureless Huber-LASSO as described in Section 
|3.4.2 followed by estimating (3 using Eq 0. 

4) Raw: our URLR model without outlier detection, 
that is, all annotations are used to estimate (3. 


Comparative results The interestingness prediction 
performance of the various models are evaluated while 
varying the amount of pairwise annotation used. The 
results are shown in Figure [3] (left). It shows clearly that 
our URLR significantly outperforms the four alternatives 
for a wide range of annotation density. This validates 
the effectiveness of our method. In particular, it can 


8. We delete 8 attribute features from the original feature vector in 
EM, 0 such as "attractive" because they are highly correlated with 
image interestingness. 

9. On average, for each labelled pair, around 80% of the annotations 
agree with one ranking order and 20% the other. 
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Dataset 

No. pairs 

No. img/video 

Feature Dim. 

No. classes 

Image Interestingness 1241 

16000 

2222 

932 (150) 

1 

Video Interestingness 0 

60000 

420 

1000 (60) 

14 

PubFic l54l. l2l 

2616 

772 

557 (100) 

8 

Scene lUTH 

1378 

2688 

512 (100) 

8 

FG-Net Face Age Dataset [56J 

- 

1002 

55 

- 


Table 1 

Dataset summary. We use the original features to learn the ranking model (Eq {9}) and reduce the feature dimension 
(values in brackets) using Kernel PCA (57) to improve outlier detection (Eq j8|) by enlarging the projection space of 7 . 



Figure 3. Image interestingness prediction comparative evaluation. Smaller Kendall tau distance means better 
performance. The mean and standard deviation of each method over 10 trials are shown in the plots. 
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Figure 4. Qualitative examples of outliers detected by URLR. In each box, there are two images. The left image was 
annotated as more interesting than the right. Success cases (green boxes) show true positive outliers detected by 
URLR (i.e. right images are more interesting according to the ground truth). Two failure cases are shown in red boxes 
(URLR thinks the images on the right are more interesting but the ground truth agrees with the annotation). 


be observed that: (1) The improvement over Maj-Vot-1 
0 and Maj-Vot-2 0] demonstrates the superior outlier 
detection ability of URLR due to global rather than local 
outlier detection. (2) URLR is superior to Huber-LASSO- 
FL because the joint outlier detection and ranking pre¬ 
diction framework of URLR enables the enlargement of 
the projection space T for 7 (see Section 3.4.2^ resulting 
in better outlier detection performance. (3) The perfor¬ 
mance of Maj-Vot-2 01 is the worst among all methods 
compared, particularly so given sparser annotation. This 
is not surprising - in order to get an reliable absolute 
interestingness value, dozens or even hundreds of com¬ 
parisons per image are required, a condition not met by 
this dataset. (4) The performance of Huber-LASSO-FL is 
also better than Maj-Vot-1 and Maj-Vot-2 suggesting even 
a weaker global outlier detection approach is better then 
the majority voting based local one. (5) Interestingly even 
the baseline method Raw gives a comparable result to 
Maj-Vot-1 and Maj-Vot-2 which suggests that just using 
all annotations without discrimination in a global cost 


function (Eq is as effective as majority voting] 

Figure [3] (right) evaluates how the performances of 
URLR and Huber-LASSO-FL are affected by the pruning 
rate p. It can be seen that the performance of URLR is 
improving with an increasing pruning rate. This means 
that our URLR can keep on detecting true positive 
outliers. The gap between URLR and Huber-LASSO-FL 
gets bigger when more comparisons are pruned showing 
Huber-LASSO-FL stops detecting outliers much earlier 
on. However, when the pruning rate is over 55%, since 
most outliers have been removed, inliers start to be 
pruned, leading to poorer performance. 

Qualitative Results Some examples of outlier detec¬ 
tion using URLR are shown in Figure [4] It can be seen 
that those in the green boxes are clearly outliers and 

10. One intuitive explanation for this is that given a pair of data with 
multiple contradictory votes, using Raw, both the correct and incorrect 
votes contribute to the learned model. In contrast, with Maj-Vot, one 
of them is eliminated, effectively amplifying the other's contribution 
in comparison to Raw. When the ratio of outliers gets higher, Maj-Vot 
will make more mistakes in eliminating the correct votes. As a result, 
its performance drops to that of Raw, and eventually falls below it. 
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are detected correctly by our URLR. The failure cases 
are interesting. For example, in the bottom case, ground 
truth indicates that the woman sitting on a bench is more 
interesting than the nice beach image, whilst our URLR 
predicts otherwise. The odd facial appearance on that 
woman or the fact that she is holding a camera could 
be the reason why this image is considered to be more 
interesting than the otherwise more visually appealing 
beach image. However, it is unlikely that the features 
used by URLR are powerful enough to describe such 
fine appearance details. 

4.2 Video interestingness prediction 

Datasets The video interestingness dataset is the 
YouTube interestingness dataset introduced in 0. It con¬ 
tains 14 categories of advertisement videos (e.g. 'food' 
and 'digital products'), each of which has 30 videos. 
10 ^ 15 annotators were asked to give complete inter¬ 
esting comparisons for all the videos in each category. 
So the original annotations are noisy but not sparse. We 
used bag-of-words of Scale Invariant Feature Transform 
(SIFT) and Mel-Frequency Cepstral Coefficient (MFCC) 
as the feature representation which were shown to be 
effective in 0 for predicting video interestingness. 
Experimental settings Because comparing videos 
across different categories is not very meaningful, we 
followed the same settings as in 0 and only compared 
the interestingness of videos within the same category. 
Specifically, from each category we used 20 videos and 
their paired comparisons for training and the remaining 
10 videos for testing. The experiments were repeated 
for 10 rounds and the averaged results are reported. 

Since MFCC and SIFT are bag-of-words features, we 
employed y 2 kernel to compute and combine the fea¬ 
tures. To facilitate the computation, the x 2 kernel is 
approximated by additive kernel of explicit feature map¬ 
ping l58l . To make the results of this dataset more 
comparable to those in 0, we used rankSVM model to 
replace Eq as the ranking model. As in the image 
interestingness experiments, we used Kendal tau rank 
distance as the evaluation metric, while we find that the 
same results can be obtained if the prediction accuracy 
in CD is used. The pruning rate was again set to 20%. 
Comparative Results Figure |5|a) compares the inter¬ 
estingness prediction methods given varying amounts of 
annotation, and Figure |5jb) shows the per category per¬ 
formance. The results show that all the observations we 
had for the image interestingness prediction experiment 
still hold here, and across all categories. However in gen¬ 
eral the gaps between our URLR and the alternatives are 
smaller as this dataset is densely annotated. In particular 
the performance of Huber-LASSO-FL is much closer to 
our URLR now. This is because the advantage of URLR 
over Huber-LASSO-FL is stronger when \E\ is close to 
\V\. In this experiment, \E\ (1000s) is much greater than 
\V\ (20) and the advantage of enlarging the projection 
space T for 7 (see Section |3.4.2|) diminishes. 

Qualitative Results Some outlier detection examples 
are shown in Figure [ 6 ] In the two successful detection 


examples, the bottom videos are clearly more interesting 
than the top ones, because they ( 1 ) have a plot, some¬ 
times with a twist, and ( 2 ) are accompanied by popular 
songs in the background and/or conversations. Note 
that in both cases, majority voting would consider them 
inliners. The failure case is a hard one: both videos have 
cartoon characters, some plot, some conversation, and 
similar music in the background. This thus corresponds 
to a truly ambiguous case which can go either way. 

4.3 Relative attributes prediction 

Datasets The PubFig (54| and Scene (55| datasets are 
two relative attribute datasets. PubFig contains 772 im¬ 
ages from 8 people and 11 attributes ('smiling', 'round 
face', etc.). Scene 155]| consists of 2688 images from 8 
categories and 6 attributes ('openness', 'natrual' etc.). 
Pairwise attribute annotation was collected by Amazon 
Mechanical Turk 0. Each pair was labelled by 5 workers 
and majority vote was used in 0 to average the com¬ 
parisons for each paiip) A total of 241 and 240 training 
images for PubFig and Scene respectively were labelled 
(i.e. compared with at least another image). The average 
number of compared pairs per attribute were 418 and 
426 respectively, meaning most images were only com¬ 
pared with one or two other images. The annotations 
for both datasets were thus extremely sparse. GIST and 
colour histogram features were used for PubFig, and 
GIST alone for Scene. Each image also belongs to a class 
(different celebrities or scene types). These datasets were 
designed for classification, with the predicted relative 
attribute scores used as image representation. 
Experimental Settings We evaluated two different im¬ 
age classification tasks: multi-class classification where 
samples from all classes were available for training and 
zero-shot transfer learning where one class was held out 
during training (a different class was used in each trial 
with the result averaged). Our experiment setting was 
similar to that in m, except that image-level, rather than 
class-level pairwise comparisons were used. Two settings 
were used with different amounts of annotation noise: 

• Orig: This was the original setting with the pairwise 
annotations used as they were. 

• Orig+synth: By visual inspection, there were lim¬ 
ited annotation outliers in these datasets, perhaps 
because these relative attributes are less subjec¬ 
tive compared to interestingness. To simulate more 
challenging situations, we added 150 random com¬ 
parisons for each attribute, many of which would 
correspond to outliers. This will lead to around 20% 
extra outliers. 

The pruning rate was set to 7% for the original datasets 
(Orig) and 27% for the dataset with additional outliers 
inserted for all attributes of both datasets (Orig+synth ). 
Evaluation metrics For Scene and Pubfig datasets, 
relative attributes were very sparsely collected and their 

11. Thanks to the authors of (2 we have all the the raw pairs data 
before majority voting. 




11 



Figure 5. Video interestingness prediction comparative evaluation. 



Figure 6. Qualitative examples of video interestingness outlier detection. For each pair, the top video was annotated 
as more interesting than the bottom. Green boxes indicate the annotations are correctly detected as outliers by our 
URLR and red box indicates a failure case (false positive). All 6 videos are from the ‘food’ category. 


prediction performance is thus evaluated indirectly by 
image classification accuracy with the predicted relative 
attributes as image representation. Note that for image 
classification there is ground truth and its accuracy is 
clearly dependent on the relative attribute prediction 
accuracy. For both datasets, we employed the method 
in HI to compute the image classification accuracy. 

Comparative Results Without the ground truth of 
relative attribute values, different models were evalu¬ 
ated indirectly via image classification accuracy in Fig¬ 
ure [7] The following observations can be made: (1) Our 
URLR always outperforms Huber-LASSO-FL, Maj-Vot-1 
and Raw for all experiment settings. The improvement 
is more significant when the data contain more errors 
(Orig+synth) . (2) The performance of other methods is in 
general consistent to what we observed in the image and 
video interestingness experiments: Huber-LASSO-FL is 
better than Maj-Vot-1 and Raw often gives better results 
than majority voting. (3) For PubFig, Maj-Vot-1 |5| is 
better than Raw given more outliers, but it is not the 
case for Scene. This is probably because the annotators 
were more familiar with the celebrity faces in PubFig and 
hence their attributes than those in Scene. Consequently 
there should be more subjective/intentional errors for 


Scene, causing majority voting to choose wrong local 
ranking orders (e.g. some people are unsure how to com¬ 
pare the relative values of the 'diagonal plane' attribute 
for two images). These majority voting + outlier cases 
can only be rectified by using a global approach such as 
our URLR , and Huber-LASSO-FL to a lesser extent. 

Qualitative Results Figure [8] gives some examples of 
the pruned pairs for both datasets using URLR. In the 
success cases, the left images were (incorrectly) anno¬ 
tated to have more of the attribute than the right ones. 
However, they are either wrong or too ambiguous to 
give consistent answers, and as such are detrimental to 
learning to rank. A number of failure cases (false positive 
pairs identified by URLR) are also shown. Some of them 
are caused by unique view point (e.g. Hugh Laurie's 
mouth is not visible, so it is hard to tell who smiles 
more; the building and the street scene are too zoomed 
in compared to most other samples); others are caused 
by the weak feature representation, e.g. in the 'male' 
attribute example, the colour and GIST features are not 
discriminative enough for judging which of the two men 
has more 'male' attribute. 

Running Cost Our algorithm is very efficient with a 
unified framework where all outliers are pruned simulta- 
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Figure 7. Relative attribute performance evaluated indirectly as image classification rate (chance = 0.125). 
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Figure 8. Qualitative results on image relative attribute prediction. 


neously and the ranking function estimation has a closed 
form solution. Using URLR on PubFig, it took only 1 
minutes to prune 240 images with 10722 comparisons 
and learn the ranking function for attribute prediction 
on a PC with four 3.3GHz CPU cores and 8GB memory. 

4.4 Human age prediction from face images 

In this experiment, we consider age as a subjective visual 
property of a face. This is partially true - for many 
people, given a face image predicting the person's age 
can be subjective. The key difference between this and 
the other SVPs evaluated so far is that we do have the 
ground truth, i.e. the person's age when the picture was 
taken. This enables us to perform in-depth evaluation of 
the significance of our URLR framework over the alter¬ 
natives on various factors such as annotation sparsity, 
and outlier ratio (we now know the exact ratio). Outlier 
detection accuracy can also now be measured directly. 
Dataset The FG-NET image age datasef^] was em¬ 
ployed which contains 1002 images of 82 individuals 
labelled with ground truth ages ranging from 0 to 69. 
The training set is composed of the images of 41 ran¬ 
domly selected people and the rest used as the test set. 
All experiments were repeated 10 times with different 
training/testing splits to reduce variability. Each image 
was represented by a 55 dimension vector extracted by 
active appearance models (AAM) [|56l . 

12. http://www.fgnet.rsunit.com/ 


Crowdsourcing errors We used the ground truth age 
to generate the pairwise comparisons without any er¬ 
ror. Errors were then synthesised according to human 
error patterns estimated by data collected by an online 
pilot stud)j^] 4000 pairwise image comparisons from 20 
willingly participating "good" workers were collected as 
unintentional errors. So we assume they are not contribut¬ 
ing random or malicious annotations. Thus the errors of 
these pairwise comparisons come from the natural data 
ambiguity. The human unintentional age error pattern 
was built by fitting the error rate against true age differ¬ 
ence between collected pairs. As expected, humans are 
more error-prone for smaller age difference. Specifically, 
we fit quadratic polynomial function to model relation 
of age difference of two samples towards the chance 
of making an unintentional error. We then used this 
error pattern to generate unintentional errors. Intentional 
errors were introduced by 'bad' workers who provided 
random pairwise labels. This was easily simulated by 
adding random comparisons. In practice, human errors 
in crowdsourcing experiments can be a mixture of both 
types. Thus two settings were considered: Unint.: errors 
were generated following the estimated human unin¬ 
tentional error model resulting in around 10% errors. 
Unint.-bint.: random comparisons were added on top 
of Unint., giving an error ratio of around 25%, unless 
otherwise stated. Since the ground-truth age of each face 

13. http://www.eecs.qmul.ac.uk/~yf300/survey4/ 
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Figure 9. Comparing URLR and Huber-LASSO-FL on 
ranking prediction under two error settings. Note that the 
ranking prediction accuracy is measured using Kendall 
tau rank correlation which is very similar to Kendall tau 
distance (see [59]). With rank correlation, the higher the 
value the better the performance. 
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Figure 10. Comparing URLR and Huber-LASSO-FL 
against majority voting (5 comparisons per pair). 


image is known to us, we can give an upper bound for 
all the compared methods by using ground-truth age of 
training data to generate a set of pairwise comparisons. 
This outlier-free dataset is then used to learn a kernel 
ridge regression with Gaussian kernel. This ground-truth 
data trained model is denoted as GT. 

Quantitative results Four experiments were conducted 
using different settings to show the effectiveness of our 
URLR method quantitatively. 

(1) URLR vs. Huber-LASSO-FL. In this experiment, 300 
training images and 600 unique comparisons were ran¬ 
domly sampled from the training set. Figure [9] shows that 
URLR and Huber-LASSO-FL improve over Raw indicat¬ 
ing that outliers are effectively pruned using both global 
outlier detection methods. Both methods are robust to 
low error rate (Figure 9] Left: 10% in Unint.) and are 
fairly close to GT, whilst the performance of URLR is 
significantly better than that of Huber-LASSO-FL given 
high error ratio (Figure [9] Right: 25% in Unint.+lnt.) 
because of the using low-level feature representation to 
increase the dimension of projection space dimension for 
7 from 301 for Huber-LASSO-FL to 546 for URLR (see 
Section |3.4.2| ). This result again validates our analysis 
that higher dim(T) leads to better chance of identifying 
outliers correctly. It is noted that in Figure [9|Right), given 
25% outliers, the result indeed peaks when p is around 
25; importantly, it stays flat when up to 50% of the 
annotations are pruned. 

(2) Comparison with Maj-Vot-1. Given the same data but 
each pair compared by 5 workers (instead of 1) under the 



Figure 11. Effect of error ratio. Left: outlier detection 
performance measured by area under ROC curve (AUC). 
Right: rank prediction performance measured by rank 
correlation. 



Figure 12. Relationship between the pruning order and 
actual age difference for URLR. 


Unint.+lnt. error condition. Figure [lO] shows that Maj- 
Vot-1 beats Raw. This shows that for relative dense graph, 
majority voting is still a good strategy of removing 
some outliers and improves the prediction accuracy. 
However, URLR outperforms Maj-Vot-1 after the pruning 
rate passes 10%. This demonstrates that aggregating all 
paired comparisons globally for outlier pruning is more 
effective than aggregating them locally for each edge as 
done by majority voting. 

(3) Effects of error ratio. We used the Unint.+lnt. error 
model to vary the amount of random comparisons 
and simulate different amounts of errors in 10 sam¬ 
pled graphs from 300 training images and 2000 unique 
sampled pairs from the training images. The pruning 
rate was fixed at 25%. Figure 11 shows that URLR 
remains effective even when the true error ratio reaches 
as high as 35%. This demonstrates that although a sparse 
outlier model is assumed, our model can deal with non- 
sparse outliers. It also shows that URLR consistently 
outperforms the alternative models especially when the 
error/outlier ratio is high. 

What are pruned and in what order? The effectiveness 
of the employed regularisation path method for outlier 
detection can be examined as A decreases to produce 
a ranked list for all pairwise comparisons according to 
the outlier probability. Figure [12] shows the relationship 
between the pruning order (i.e. which pair is pruned 
first) and ground truth age difference and illustrated by 
examples. It can be seen that overall outliers with larger 
age difference tend to be pruned first. This means that 
even with a conservative pruning rate, obvious outliers 
(potentially causing more performance degradation in 
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learning) can be reliably pruned by our model. 

5 Conclusions and Future Work 

We have proposed a novel unified robust learning to 
rank (URLR) framework for predicting subjective visual 
properties from images and videos. The key advantage 
of our method over the existing majority voting based 
approaches is that we can detect outliers globally by 
minimising a global ranking inconsistency cost. The 
joint outlier detection and feature based rank prediction 
formulation also provides our model with an advantage 
over the conventional robust ranking methods without 
features for outlier detection: it can be applied with a 
large number of candidates in comparison but a sparse 
sampling in crowdsourcing. The effectiveness of our 
model in comparison with state-of-the-art alternatives 
has been validated on the tasks of image and video inter¬ 
estingness prediction and predicting relative attributes 
for visual recognition. Its effectiveness for outlier detec¬ 
tion has also been evaluated in depth in the human age 
estimation experiments. 

By definition subjective visual properties (SVPs) are 
person-dependent. When our model is learned using 
pairwise labels collected from many people, we are 
essentially learning consensus - given a new data point 
the model aims to predict its SVP value that can be 
agreed upon by most people. However, the predicted 
consensual SVP value could be meaningless for a specific 
person when his/her taste/understanding of the SVP is 
completely different to that of most others. How to learn 
a person-specific SVP prediction model is thus part of 
the on-going work. Note that our model is only one of 
the possible solutions to inferring global ranking from 
pairwise comparisons. Other models exist. In particular, 
one widely studied alternative is the (Bradley-Terry- 
Luce (BTL) model [60|, [61], l62l ), which aggregates the 
ranking scores of pairwise comparisons to infer a global 
ranking by maximum likelihood estimation. The BTL 
model is introduced to describe the probabilities of the 
possible outcomes when individuals are judged against 
one another in pairs [60|. It is primarily designed to 
incorporate contextual information in the global rank¬ 
ing model. We found that directly applying the BTL 
model to our SVP prediction task leads to much inferior 
performance because it does not explicitly detect and 
remove outliers. However, it is possible to integrate it 
into our framework to make it more robust against 
outliers and sparse labels whilst preserving its ability 
to take advantage of contextual information. Other new 
directions include extending the presented work to other 
applications where noisy pairwise labels exist, both in 
vision such as image denoising l63l , iterative search and 
active learning of visual categories |30||, and in other 
fields such as statistics and economics 021- 
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Supplementary Material 

Thanks for the excellent questions from the anonymous 
reviewers of our TPAMI submission. In answering their 
questions, we found some details and insights of our 
framework which have been overlooked before. Due 
to the page limits of our journal version, we use this 
document to further explain the details and insights and 
help our readers better understand our work. 

1) Further, the proposed approach doesn't seem to 
truly get to the bottom of why subjective properties 
are tricky namely that two people might actu¬ 
ally have a different understanding of the prop¬ 
erty. While the authors do refer to such possible 
disagreements in the introduction, the proposed 
method doesn't seem to consider this possibility. In 
other words, how does it make sense to consider 
a single global order when such an order might be 
unattainable since person A's "interestingness" will 
differ from person B's? 

This is a very good question. Indeed, since the properties 
are subjective, they are by definition person-dependent. 
However, in most applications when we learn a SVP 
prediction model using pairwise labels collected from 
many different annotators, we are modeling consensus. 
In other words, the model essentially aggregates the 
understandings of different people regarding a certain 
SVP so that the predicted SVP for an unseen data point 
can be agreed upon by most people. For example, in 
the case of video interestingness, YouTube may want to 
predict the interestingness of a newly uploaded video 
so as to decide whether or not to promote it. Such a 
prediction obviously needs to be based on consensus from 
the majority of the YouTube viewers regarding what 
defines interestingness. However, collecting consensus 
can be expensive; the proposed model in this paper thus 
aims to infer the consensus from as few labels as possible. 

It is also true that for a specific person, he/she would 
prefer a SVP prediction model that is tailor-made for 
his/her own understanding of the SVP, i.e. a person- 
specific prediction model. Such a model needs to be 
learned using his/her pairwise labels only. For example, 
YouTube could recommend different videos for different 
registered users when they log in, if they provide some 
pairwise video interestingness labels for learning such 
a model (at present, this is done based on some simple 
rules from the viewing history of the user). This also has 
its own problem - it is much harder to collect enough 
labels from a single person only to learn the prediction 
model. There are solutions, e.g. categorising the users 
into different groups so that the labels from people of 
the same group can be shared. However this is beyond 
the scope of this paper and is being considered as part 
of ongoing work. 

We have provide a discussion on this problem in Section 
5 in the revised manuscript (Page 14). 

2) It feels a little bit unsatisfying that the method 
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requires we pick a fixed ratio of outliers. This 
would be more ok if the ratio can be automatically 
computed from the data somehow. 

Indeed, the pruning rate is a free parameter of the 
proposed model (in fact, the only free parameter) that has 
to set manually. As discussed in the beginning of Section 
3.3, most existing outlier detection algorithms have a 
similar free parameter determining how aggressive the 
algorithm needs to be for pruning outliers. Automated 
model selection critieria such as BIC and AIC could 
be considered. However, as pointed out by [49], they 
are often unstable for the outlier detection problem with 
pairwise labels. We have carried out experiments to show 
that when BIC or AIC is employed, the selected model 
failed to detect meaningful outliers. Since a related 
comment is given by Reviewer 3, please refer to the 
Response Point 2 to Reviewer 3 for detailed experiment 
results and analysis on the alternative outlier detection 
methods including BIC. It is also worth pointing out 
that our results on the effect of the pruning rate show 
that the proposed model remains effective given a wide 
range of pruning rate values (see Fig. 3, 5, 9 and 10). 
We have now added a footnote in Section 3.3 to discuss 
why an automated model selection criterion such as BIC 
is not adopted. 

3) I think cases of Raw performing similarly or better 
than MajVotl/ 2 should be explained in a little 
more detail, i.e. an intuition for such outcomes 
should be given. 

Thanks for the suggestion. Indeed, our results on both 
image and video interestingness experiments show that 
Raw performs similarly to majority voting. There is 
an intuitive explanation for that. When a pair of data 
points A and B receive multiple votes/labels of dif¬ 
ferent pairwise orders/ranks, these multiple labels are 
converted into a single label corresponding to the order 
that receives the most votes. Since only one of the 
two orders is correct (either A>B or B>A), there are 
two possibilities: the majority voted label is correct, or 
incorrect, i.e. an outlier. In comparison, using Raw, 
all votes count, so the outlying votes would certainly 
having a negative effect on the learned prediction model, 
so would the correct votes/labels. Now let us consider 
which method is better. The answer is it depends on the 
outlier/error ratio of the labels. If the ratio is very low, 
majority voting will get rid of almost all the outlying 
votes; MajVot would thus be advantageous over Raw 
which can still feel the negative effects of the outliers. 
However, when the ratio gets bigger, it becomes possible 
that the outlying label becomes the winning vote. For 
example, if A>B is correct, and received 2 votes and 
A<B is incorrect and received 3 outlying votes. Using 
Raw, those 2 correct votes still contribute positively 
to the model, whilst using MajVot, their contribution 
disappears and the negative impact of the outlying votes 
is amplified. Therefore, one expects that when MajVot 
makes more and more mistakes, its performance will get 
closer to that of Raw, until it reaches a tipping point 


where Raw starts to get ahead. 

We have added a brief discussion on this in Section 4.1. 
on Page 9. 

4) In Figure 3, why does the Kendall tau distance start 
to increase as the pruning rate increases, after 55% 
for URLR? 

Higher Kendall tau distance means worse prediction. 
Figure 3 (right) thus shows that our URLR's perfor¬ 
mance is improved when more and more outliers are 
pruned in the beginning; then after more than 55% of 
pairs are pruned, its performance starts to decrease. This 
result is expected: at low pruning rates, most of the 
pruned pairs are outliers; the model therefore benefits. 
Since the percentage of outliers would almost certainly 
be lower than 50%, when the pruning rate reaches 
55%, most of the outliers have been removed, and the 
algorithm start to remove the correctly labelled pairs. 
With less and less correct labels available to learn the 
model, the performance naturally would decrease - when 
pruning rate gets close to 100%, it would not be possible 
to learn a meaningful model; the Kendall tau distance 
would thus shoot up. 

We have now added a sentence on Page 10 to give an 
explanation to this phenomenon. 

5) Page 2 line 44 'Tor example. Figure 1 ... " the 
authors try to argue that examples shown in Fig¬ 
ure 1 are outliers. I don't quite agree. Authors 
are trying to study subjective attributes. These 
are good examples of subjective versus objective 
attributes. This doesn't seem to be about outliers 
vs. not. In fact, one source of outliers other than 
malicious workers is global (in)consistency, which 
is not mentioned here. The authors could draw 
from the concrete example of Figure 2. 

This is a very good point. It is certainly worthwhile 
to clarify the definition of outlier in the context of 
subjective visual property (SVP). In particular, since 
by definition a SVP is subjective, defining outlier, even 
making the attempt to predict SVP is self-contradictory 
- one man's meat is another man's poison. However, 
there is certainly a need for learning a SVP prediction 
model, hence this paper. This is because when we learn 
the model from labels collected from many people, we 
essentially aim to learn the consensus, i.e. what most 
people would agree on (please see our Response Points 
1 for more discussion on this). Therefore, Figure 1(a) 
can still be used to illustrate this outlier issue in SVP 
annotation, that is, you may have most of the annotators 
growing up watching Sesame Street thus consciously 
or subconsciously consider the Cookie Monster to be 
more interesting than the Monkey King; their pairwise 
labels/votes thus represent the consensus. In contrast, 
one annotator who is familiar with the stories in Journey 
to the West may choose the opposite; his/her label is thus 
an outlier under the consensus. We have reworded the 
relevant text on Page 2 to avoid confusion. 

6) A baseline to compare to might be to feed all 
individual constraints (without majority vote) to a 
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rankSVM. SVMs already allow for some slack. So 
I would be curious to know if that takes care of 
some of the outliers already. 

Thanks. In fact, we do have one set of results on this. 
Specifically, in Sec 4.2 "Video interestingness predic¬ 
tion", as explained under "Experimental settings", we 
employed rankSVM model to replace Eq (9). Therefore, 
the model denoted as 'Raw' in this experiment is exactly 
the suggested baseline of feeding all constraints to a 
rankSVM. As shown in Fig. 5(a), the model is at par 
with Maj-Vot-1 but worse then the two global outlier 
detection methods Huber-EASSO-FE and our URER. 
This result suggests that rankSVM does have some 
ability to cope with outliers. However, we are not sure 
this is due to the slack variables of rankSVM. This is 
because the slack variables are introduced to account for 
data noise [?] which is different from the outliers in the 
pairwise data. 

7) For the scene and pubfig image dataset, the rela¬ 
tive attribute prediction performance can only be 
evaluated indirectly by image classification accu¬ 
racy with the predicted relative attributes as image 
representation." > Why is that? Can't you compute 
attribute prediction performance on a held out set 
of annotated pairs? Or is the concern that since 
the pairs may be noisily annotated, one can not 
think of them as GT? But is that not an issue with 
interestingness then? Please clarify in rebuttal. 
Thanks for this question. We stated in footnote 9 that 
"Collecting ground truth for subjective visual properties 
is always problematic. Recent statistical theories [61], 
[19] suggest that the dense human annotations can give 
a reasonable approximation of ground truth for pairwise 
ranking. This is how the ground truth pairwise rankings 
provided in [4] and [5] were collected." So for image and 
video interestingness as well as the age dataset, (dense) 
enough pairwise comparisons are available to give a 
reasonable approximation of the groundtruth. However, 
this is not the case for scene and pubfig image dataset: 
the collected pairs are much more sparse and cannot be 
used as an approximation to the groundtruth. In short, 
it is because they are too sparse rather than too noisy. 
In contrast, the indirect evaluation metric of down¬ 
stream classification accuracy has clear unambiguous 
groundtruth, and directly depends on relative attribute 
prediction accuracy. So this evaluation is preferred. 

8) Related Work: The Bradley-Terry-Luce (BTL) model 
is the standard model for computing a global rank¬ 
ing from pairwise labels. It should be mentioned in 
the related work. See [52] or Hunter, D. R. (2004). 
MM algorithms for generalized BradleyTerry mod¬ 
els. Annals of Statistics. Experiments: I would 
expect additional comparisons to state-of-the-art 
(BTL or SVM-rank aggregation [52]). In particular 
the Bradley-Terry-Luce (BTL) model is extremely 
widely used and more robust to noise than LASSO 
based approaches [52]. E.g. "Generalized Method- 
of-Moments for Rank Aggregation" or "Efficient 


Bayesian Inference for Generalized Bradley-Terry 
Models" provide code for inference in BTL models. 
Such a method leads to a global ranking, which 
could be used to train an SVM. Alternatively, it 
can be used to find pairwise rankings that disagree 
with the obtained global ranking. These could be 
removed as outliers and a rank-SVM trained from 
the remaining pairwise labels. Such an experi¬ 
ment should be included as an additional state-of- 
the-art comparison in the updated version of the 
manuscript. 

Thanks for the suggestion. Indeed, the Bradley-Terry- 
Luce (BTL) model is a very relevant global ranking 
model. We have now studied it carefully and made con¬ 
nections to the proposal URLR model. We also carried 
out new experiments to evaluate the BET model for our 
Subjective Visual Property (SVP) prediction task. 

More specifically, the BTL model is a probabilistic model 
that aggregates the ranking scores of pairwise compar¬ 
isons to infer a global ranking by maximum likelihood 
estimation. It is closely related to the proposed global 
ranking model; yet it also has some vital differences. 
Let's first look at the connection. The main pairwise 
ranking model of Huber-L AS SO used in this paper is a 
linear model (see Eq (10) and Eq (12)), which is 

Vij = ~ @j + 7 ij + €ij (16) 

In statistics and psychology KT9i . f64ti , 1I5TV , [?], such 
a linear model can be extended to a family of gener¬ 
alised linear models when only binary comparisons are 
available for each pair he. either i is preferred to 
j or vice versa. In these generalised linear models, one 
assumes that the probability of pairwise preference is 
fully determined by a linear ranking/rating function in 
the following, 

7 Tij = Prob {i is preferred over j} = ^(Q i — 0j) 

where 4> : M —>► [0,1] can be chosen as any symmetric 
cumulated distributed function. 

Different choices of lead to different generalised linear 
models. In particular, two choices are worth mentioning 
here: 

• Uniform model, 

Vij = 27 Tij - 1 (17) 

This model is equivalent to use yij = 1 if i is 
preferred to j and yij = — 1 otherwise in linear 
model. This model is used in this work to derive 
our URLR model. 

• Bradley-Terry-Luce (BTL) model, 

Vij = tog -. 7 ( 18 ) 

1 TTij 
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Image Interestingnes 



Figure 13. Comparing the BTL model with our model on 
image interestingness prediction 


So by now, it is clear that both our URLR and BTL 
generalise the linear model in Huber-LASSO. They dif¬ 
fer in the choice of the symmetric cumulated distributed 
function T>. 

Although both of them are generalised from the same 
linear model, they are developed for very different pur¬ 
poses. The BTL model is introduced to describe the 
probabilities of the possible outcomes when individuals 
are judged against one another in pairs f60t , [?]. It is 
primarily designed to incorporate contextual informa¬ 
tion in the global ranking model. For instance, in sports 
applications, it can be used to account for the home-field 
advantage and ties situations f64j, /[62l . In contrast, our 
framework tries to detected outliers in the pairwise com¬ 
parisons and cope with the sparse labels. Consequently, 
from Eq (1) onwards, we introduce the outlier variable 
to model the outiers explicitly and introduce low-level 
feature variable to enhance our model's ability to detect 
outliers given sparse labels. None of these is in the BLT 
model, which means that it may not be suitable given 
sparse pairwise comparisons with outliers. 

To verify this, we took the suggestion by Reviewer 3 
and employed the matlab codes from the website qfSE3 
"Efficient Bayesian Inference for Generalized BradleyA- 
Terry Models" to carry out experiments. The results on 
image interestingness prediction are compared in Fig\l3\ 
It shows that the performance of BTL is much worse 
than the other alternatives. Similar results were obtained 
on video interestingness prediction and age estimation. 
As explained above, it is actually not fair to compare 
the BTL model to the other models because BTL was not 
designed for outlier detection and could not cope with 
the amount of outliers and the level of spareness in our 
SVP data. We therefore decide not to include the new 
results in the revised manuscript. However, from our 
analysis above, it is also clear that we could use the BTL 
model (Eq to generalise the linear model in place 
of the uniform model, and use it in our outlier detection 
framework. In this way, we can have the better of both 


worlds: the ability of BTE to incorporate contextual 
information such as the home-filed advantage in sports 
can also be taken advantage of in our framework whilst 
preserving our model's strength on robustness against 
outliers and sparse labels. However, this is probably 
beyond the scope of this paper and is better left to the 
future work. In the revised manuscript, we have now 
added the following paragraph in Section 5, where we 
discuss that BTE is an alternative model that can be 
integrated into our framework as part of the future work. 
"Note that our model is only one of the possible 
solutions to inferring global ranking from pairwise 
comparisons. In particular, one widely studied al¬ 
ternative is the (Bradley-Terry-Euce (BTE) model 
[61,62,63], which aggregates the ranking scores of 
pairwise comparisons to infer a global ranking by 
maximum likelihood estimation. The BTL model is 
introduced to describe the probabilities of the possi¬ 
ble outcomes when individuals are judged against 
one another in pairs [61 ]. It is primarily designed 
to incorporate contextual information in the global 
ranking model. We found that directly applying 
the BTL model to our SVP prediction task leads 
to much inferior performance because it does not 
explicitly detect and remove outliers. However, it is 
possible to integrate it into our framework to make 
it more robust against outliers and sparse labels 
whilst preserving its ability to take advantage of 
contextual information." 

9) 3.3 Regularization path. On the one hand the au¬ 
thors say that "Setting a constant X value indepen¬ 
dent of dataset is far from optimal because the ratio 
of outliers may vary for different crowdsourced 
datasets", but using the regularization path this is 
exactly what is done in the end. It is true that 
the experiments show that the proposed method is 
fairly robust w.r.t. the outlier ratio. Nonetheless, I 
would like to see an experiment using a (modified) 
BIC for selecting the outlier ratio. This would be a 
valuable extension over the ECCV work. 

Thanks. As discussed in the beginning of Section 3.3, 
most existing outlier detection algorithms have a similar 
free parameter as X to determine how aggressive the 
algorithm needs to be for pruning outliers. Automated 
model selection critieria such as BIC and AIC could 
be considered. However, as pointed out by [49], they 
are often unstable for the outlier detection problem with 
pairwise labels. 

We have evaluated alternative methods including the 
modified BIC and AIC for image and video interest¬ 
ingness prediction. The results suggest those automated 
models such as AIC and BIC failed to identify any 
outliers - they prefer the model that include all input 
pairwise comparisons. To find out why it is the case, 
we carried out a controlled experiment using synthetic 
data to investigate how different factors affect the perfor¬ 
mance of different methods for determining the outlier 
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ratio. Specifically; we compare , BIC and with our 
Regularization Path model. 

Experiments design, we use a complete graph G with 
30 nodes. Our framework is simplified into the following 
ranking model, 

Yij — hi — 0j + 7 ^- + Sij 

Let 0 ~ U (—1,1), £ij ~ AT( 0 ,cr 2 ) and 7 ^ = ±L. We 
simulate the outlier pairs by randomly sampling, f/zaf 
is, each pair's true ranking is reversed (i.e. becoming an 
outlier/error) with a probability p which will determine 
the outlier ratio. The magnitude of outliers in relation to 
that of the noise is another factor which could potentially 
affect the performance of different methods on outlier 
detection. So we define the outlier-noise-ratio ONR := 
L/a, where a = 0.1 in our experiment and L is varied 
in our experiment to give different ONR values. 
Evaluation protocols and results. We first compare 
three methods that require the manual setting of a 
free parameter corresponding to the outlier ratio. These 
include our formulation (Eq (8)) with Regularization 
Path (i.e. the proposed model), IPOD hard-threshold 
B7P| with Regularization Path, and our formulation 
with orthogonal matching pursuit j[65lf . Using our model 
with Regularization Path, A is decreased from 00 to 
0 and the graph edges are order according to how 
likely it corresponds to an outlier. The top p% edge 
set A p are detected as outliers. By varying p, ROC 
(receiver-operating-characteristic) curve can be plotted 
and AUC (area under the curve) is computed. Sim¬ 
ilarly, IPOD hard-threshold can also be solved using 
the same Regularization Path strategy. And orthogonal 
matching pursuit can be used to solve our formulation 
for outlier detection in place of Regularization Path. As 
shown in Figure\l4\ the results of our formulation with 
Regularization Path are consistently better than those 
of IPOD hard-threshold + Regularization Path and our 
formulation + orthogonal matching pursuit. Specifically, 
it shows that (1) when there are small portions of 
outliers, all the methods can reliably prune most of 
outliers; (2) in all experiments, IPOD-hard threshold 
and orthogonal matching pursuit have similar perfor¬ 
mance, whilst our formulation + Regularization Path is 
consistently better than the other alternatives, especially 
when there are large portions of outliers (high values of 
p); (3) the higher the ONR, the better performance of 
outlier detection for all three methods. 

In contrast, BIC utilises the relative quality and likeli¬ 
hood functions of statistical models themselves to deter¬ 
mine a fixed X. Therefore, the true positive rate (TPR) 
and false positive rate (FPR) for BIC are reported. The 

14. Strictly speaking, IPOD hard-threshold is not a Lasso solver, since it re¬ 
placed the soft-thresholding with hard-thresholding. However, for comparison 
convenience, we still compare it with our RP. 


results are listed in Table [2] It shows that when using 
our formulation with BIC, only when there are very 
small portions of outliers and the outlier-noise-ratio is 
extremely high, BIC can reliably prune most of outliers. 
Otherwise, it tends to consider all pairs inliers. As 
mentioned above, using BIC in place of Regularization 
Path also leads to no outliers being pruned in our 
SVP prediction experiments. This thus suggests that 
the real outlier ratio (roughly corresponds to p=0.2, see 
Response Point 10 to Reviewer 2) and/or outlier-noise- 
ratio (ONR) are too high for BIC to work. 

Due to the space constraint, we could not include all 
these results and analysis in the revised manuscript. On 
Page 6, we have now added a footnote (Footnote 3) to 
refer the readers to find additional results and discussion 
on this outlier ratio problem in the project webpage at 
http://www.eecs.qmul.ac.uk/-yf300/ranking/index.html. 

10) Page 9, Col. 2 , Line 52: The authors talk about 
global image features (GIST), but Page 8 , Line 45 
indicates that the ground truth annotations such as 
"central object", etc. were used. Using the complete 
ground truth annotation seems to be problematic, 
as it also contains an attribute "is interesting" 
and others such as "is aesthetic" and "is unusual". 
When using this ground truth, I believe such labels 
should be excluded and only content attributes 
used, (such as: indooroutdoor, contains a person, 
etc.). 

Thanks for the suggestion. We have updated this ex¬ 
periment as suggested. Specifically, we first examined 
how each of the 932 attribute features are correlated to 
the groundtruth interestness value of each image. Figure 
[l5| shows that (1) only small number of these attribute 
features have strong correlation with the interestingness 
value. (2) the histogram of kendall tau correlation^] 
of all features is roughly Gaussian as shown in Fig. 
\TE\right). 

So as suggested, for more fair comparisons, we remove 
the attribute features W7\ whose kendall tau correla¬ 
tions are higher than 0.4 or lower than -0.4. This will 
lead to deletion the features listed in Table [3] These 
pruned features include those suggested by Reviewer 
3 ("isjnteresting" and "is aesthetic"), but not the 
"unusual" attribute feature which has a low correlation 
value of -0.0226. 

We repeat the image interestingness experiments with 
the updated features. It is noticed that this has little 
effect on the results (still within the variances). 


15. Note that here, we employ kendall tau correlation rather than the Spear¬ 
man correlation (Spearman correlation of "is interesting" vs. groundtruth is 
0.63 as reported in (4j) since Spearman correlation is much more sensitive 
to error and discrepancies in data and Kendall tau correlation l 66 t generally 
have better statistical properties. 
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0NR=4 0NR=5 0NR=6 



0NR=7 0NR=8 0NR=9 



Figure 14. Effects of outlier/error probability (p) and outlier-noise ratio (ONR) on our formulation + Regulaization 
Parth (denoted as Regulaization Parth), IPOD-hard threshold + Regulaization Parth and our formulation + Orthogonal 
matching pursuit. 



ONR=4 

ONR=5 

ONR=6 

ONR=7 

ONR=8 

ONR=9 

p=0.1 

0.002/0 

0.494/0.012 

1/0.003 

1/0.026 

1/0.025 

1/0.031 

p=0.2 

0/0 

0/0 

0.3/0.016 

0.9/0.05 

1/0.064 

1/0.037 

p=0.3 

0/0 

0/0 

0/0 

0/0 

0/0 

0.5/0.06 

p=0.4 

0/0 

0/0 

0/0 

0/0 

0/0 

0/0 

p=0.5 

0/0 

0/0 

0/0 

0/0 

0/0 

0/0 

p=0.6 

0/0 

0/0 

0/0 

0/0 

0/0 

0/0 


Table 2 

The outlier detection results of our formulation + BIC. The results are presented as TPR/FPR. The error probability 

and ONR are: p e [0.1,0.6] and ONR e [4,9] respectively. 


Kendall tau correlation 



Figure 15. Kendall tau correlations of each feature dimension with the ground truth interestingness value, (left) X-axis: 
each dimension; Y-axis: Correlation values; (right): histogram of the correlation for different features. 


| attribute | | 

pleasant scene 

attractive 

memorable 

is_aesthetic 

is interesting 

on post-card 

buy painting 

hang on wall 



-0.4060 

-0.4273 

-0.4618 

0.4487 

0.4715 | 

0.4767 

0.4085 

0.4209 


Table 3 

The pruned attribute features. 
























































